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Abstract 

In the manifold learning problem one seeks to dis- 
cover a smooth low dimensional surface, i.e., a manifold 
embedded in a higher dimensional linear vector space, 
based on a set of measured sample points on the surface. 
In this paper we consider the closely related problem of 
estimating the manifold's intrinsic dimension and the in- 
trinsic entropy of the sample points. Specifically, we view 
the sample points as realizations of an unknown multivari- 
ate density supported on an unknown smooth manifold. We 
present a novel geometrical probability approach, called 
the geodesic-minimal-spanning-tree (GMST), to obtaining 
asymptotically consistent estimates of the manifold dimen- 
sion and the Renyi ct-entropy of the sample density on the 
manifold. The GMST approach is striking in its simplicity 
and does not require reconstructing the manifold or esti- 
mating the multivariate density of the samples. The GMST 
method simply constructs a minimal spanning tree (MST) 
sequence using a geodesic edge matrix and uses the over- 
all lengths of the MSTs to simultaneously estimate mani- 
fold dimension and entropy. We illustrate the GMST ap- 
proach for dimension and entropy estimation of a human 
face dataset. 

Keywords: Nonlinear dimensionality reduction, geomet- 
rical probability, minimal spanning trees, intrinsic alpha- 
entropy, global manifold learning, conformal embeddings. 



1 Introduction 

Consider a class of natural occurring signals, e.g., recorded 
speech, audio, images, or videos. Such signals typically 
have high extrinsic dimension, e.g., as characterized by the 
number of pixels in an image or the number of time sam- 
ples in an audio waveform. However, most natural signals 



have smooth and regular structure, e.g. piecewise smooth- 
ness, that permits substantial dimension reduction with lit- 
tle or no loss of content information. For support of this 
fact one needs only consider the success of image, video 
and audio compression algorithms, e.g. MP3, JPEG and 
MPEG, or the widespread use of efficient computational 
geometry methods for rendering smooth three dimensional 
shapes. 

A useful representation of a regular signal class is 
to model it as a set of vectors which are constrained to a 
smooth low dimensional manifold embedded in a high di- 
mensional vector space. This manifold may in some cases 
be a linear, i.e., Euclidean, subspace but in general it is a 
non-linear curved surface. A problem of substantial recent 
interest in machine learning, computer vision, signal pro- 
cessing and statistics |34|[lU|27l[lI>]|26||33 is the determi- 
nation of the so-called intrinsic dimension of the manifold 
and the reconstruction of the manifold from a set of sam- 
ples from the signal class. This problem falls in the area 
of manifold learning which is concerned with discovering 
low dimensional structure in high dimensional data. 

When the samples are drawn from a large popula- 
tion of signals one can interpret them as realizations from 
a multivariate distribution supported on the manifold. As 
this distribution is singular in the higher dimensional em- 
bedding space it has zero entropy as defined by the stan- 
dard Lebesgue integral over the embedding space. How- 
ever, when defined as a Lebesgue integral restricted to 
the lower dimensional manifold the entropy can be finite. 
This finite "intrinsic" entropy can be useful for for ex- 
ploring data compression over the manifold or, as sug- 
gested in [21 1, clustering of multiple sub-populations on 
the manifold. The question that we address in this paper 
is: how to simultaneously estimate the intrinsic dimension 
and intrinsic entropy on the manifold given a set of random 
sample points? We present a novel geometrical probabil- 
ity approach to this question which is based on entropic 
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graph methods developed by us and reported in publica- 
tions 

Techniques for manifold learning can be classified 
into three categories: linear methods, local methods, and 
global methods. Linear methods include principal com- 
ponents analysis (PCA) 1 25 1 and multidimensional scaling 
(MDS) 1 12]. They are based on analyzing eigenstructure of 
empirical covariance matrices, and can be reliably applied 
only when the manifold is a linear subspace. Local meth- 
ods include linear local imbedding (LLE) [32 1, locally lin- 
ear projections (LLP) |24|, Laplacian eigenmaps [4|, and 
Hessian eigenmaps 1 16 1. They are based on local approxi- 
mation of the geometry of the manifold, and are computa- 
tionally simple to implement. Global approaches include 
ISOMAP |34 1 and C-ISOMAP EJ. They preserve the 
manifold geometry at all scales, and have better stability 
than local methods. 

We propose a geodesic-minimal-spanning-tree 
(GMST) method for manifold learning that is imple- 
mented as follows. First a complete geodesic graph 
between all pairs of data samples is constructed, e.g. using 
ISOMAP or C-ISOMAP. Then a minimal spanning graph, 
the GMST, is obtained by pruning the complete geodesic 
graph down to a subgraph that still connects all points 
but has minimum total geodesic length. The intrinsic 
dimension and intrinsic a-entropy is then estimated from 
the GMST length functional using a simple linear least 
squares (LLS) and method of moments (MOM) procedure. 

The GMST method falls in the category of global ap- 
proaches to manifold learning but it differs significantly 
from the aforementioned methods. First, it has a differ- 
ent scope. Indeed, unlike ISOMAP and C-ISOMAP, the 
GMST method provides a statistically consistent estimate 
of the intrinsic entropy in addition to the intrinsic dimen- 
sion of the manifold. To the best of our knowledge no 
other such technique has been proposed for learning man- 
ifold dimension. Second, unlike local methods that work 
on chunks of data in local neighborhoods, GMST works on 
chunks of resampled data over the global data set. Third, 
for N samples the GMST method has 0(N log N) com- 
putational complexity as compared with the 0(N 3 ) com- 
plexity of an MDS ISOMAP reconstruction. Fourth, the 
GMST method is simple and elegant: it estimates intrin- 
sic entropy and dimension by detecting the rate of increase 
of a GMST as a function of the number of its resampled 
vertices. 

The aims of this paper are limited to introducing 



GMST as a novel method for estimating manifold dimen- 
sion and entropy of the samples. As in work of others on 
dimension estimation H26I [8) we do not here consider the 
issue of reconstruction of the complete manifold. Simi- 
larly to these authors, we believe that dimension estima- 
tion and entropy estimation for non-linear data are of in- 
terest in their own right. We also do not consider the effect 
of additive noise or outliers on the performance of GMST. 
Finally, the consistency results of GMST reported here are 
limited to domain manifolds defined by some smooth un- 
known mapping. The extension of GMST methodology 
to general target manifolds, e.g. those defined by implicit 
level set embeddings 1291 1281 . is a worthwhile topic for 
future investigation. 

What follows is a brief outline of the paper. We re- 
view some necessary background on the mathematics of 
domain manifolds in Sec. [2] In Sec. [3] we review the 
asymptotic theory of entropic graphs and obtain several 
new results required for their extension to embedded man- 
ifolds. In Sec. @]we define the general GMST algorithm. 
Finally in Sec. |5]we illustrate the GMST approach to es- 
timating intrinsic dimension and entropy of a human face 
dataset. 



2 Background 

2.1 A 3D Example 

To illustrate ideas consider a 2D surface embedded in 
3D Euclidean space, called the embedding space. Let 
{xi,X2, ■ ■ ■} C U C R 2 be a set of points (samples) 
in a subset U of the plane. Naturally, the shortest path 
between any pair (xi, Xj) of these points is given by the 
straight line in M 2 connecting them, with corresponding 
distance given by its Euclidean (L2) length, \xi — Xj\%. 
Now let U be used as a parameterization space to describe 
a curved surface in M 3 via a mapping ip : U — > M 3 . 
Surfaces M = <p(U) defined in this explicit manner are 
called domain manifolds and they inherit the topological 
dimension, equal to 2 in this case, of the parameterization 
space. When <p is non-linear the shortest path on M. be- 
tween points y i = (f(xi) and = ip(xj) is a curve on 
the surface called the geodesic curve. In this paper we will 
primarily consider domain manifolds defined by confor- 
mal mappings ip. Such conformal embeddings have the 
property that the length of paths on the surface are identi- 
cal to lengths of paths in the parameterization space, pos- 
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sibly up to a smoothly varying local scale factor. This 
property guarantees that, regardless of how the mapping 
p "deforms" U onto Ai, the geodesic distances in Ai are 
closely related to the Euclidean distances in U. When this 
smooth surface representation holds there exist algorithms, 
e.g. ISOMAP and C-ISOMAP HIED, which can be used 
to estimate the Euclidean distances between points in U 
from estimates of the geodesic distances between points in 
Ai. If a certain type of minimal spanning graph is con- 
structed using these estimates well established results in 
geometrical probability 1361 1211 allow us to develop sim- 
ple estimates of both entropy and dimension of the points 
distributed on the surface. 



2.2 Differential Geometry Setting 

In the following, we recall some facts from differential ge- 
ometry needed to formalize and generalize the ideas just 
described. We will consider smooth manifolds embedded 
in R d . For the general theory we refer the reader to any 
standard book in differential geometry (for example, |9|. 
1101 . 0). An TO-dimensional smooth manifold Ai C M. d 
is a set such that each of its points has a neighborhood 
that can be parameterized by an open set of M m through 
a local change of coordinates. Intuitively, this means that 
although Ai is a (hyper) surface in R d , it can be locally 
identified withR m . 

Let ip : fi i— ► A4 be a mapping between two mani- 
folds, O, Ai. Let 7 be a curve in O. The tangent map dpx 
assigns each tangent vector v to 12 at point x the tangent 
vector dpxV to Ai at point p(x), such that, if v is the ini- 
tial velocity of 7 in il, then dpxv is the initial velocity of 
the curve p(j) in Ai. For example, if x E U CS1C R m , 
with U an open set of M m , then dpxv = J v (x) v, where 
J v = [dpi/dxj], i = 1, . . . , d, j = 1, . . . , to, is the Jaco- 
bian matrix associated with p at point x 6 Q. 



The length of a smooth curve T : [0, 1] 1— > Ai is de- 
fined as £(T) = f Q |r(i)|di. The geodesic distance be- 
tween points y 0l Vi <G AI is the length of the shortest 
(piecewise) smooth curve between the two points: 



mf{*(T) : T(0) 



y ,r(i) = yi } 



We can now define the following types of embed- 



ding s. 



Definition 1 <p : Q Ai is called a conformal mapping 
if p is a diffeomorphism (i.e., ip is differentiable, bijective 
with differentiable inverse p^ 1 ) and, at each point x £ tt, 
p preserves the angles between tangent vectors, i.e., 



(dp x v) T (dcpxw) = c{x) v T w 



(1) 



for all vectors v and w that are tangent to f2 at x, and 
c(x) > is a scaling factor that varies smoothly with x. 
If for all x £ Q, c(x) — 1, then p is said to be a (global) 
isometry. In this case the length of tangent vectors is also 
preserved in addition to the angles between them. 

It is easy to check that if there is an open set U C C 
R m , then the diffeomorphism p is a conformal mapping iff 
J ip (x) T J v (x) = c(x) l m , where l m is the m x m identity 
matrix. In this case, the geodesic distance in Ai can be 
computed as follows. Any smooth curve T : [0, 1] 1— ► Ai 
can be represented as T(t) — p(-f(t)), where 7 : [0, 1] 1— > 
is a smooth curve in M. m . Then, the length £(T) of the 
curve r is given by 



df 



\Mj(t))j(t)\dt= / V4lW)m)\dt 



e(r) 



As in W n the shortest path between any two points is 
given by the straight line that connects them, j(t) = 

xq + t(xi — xq) minimizes J Q [y(t) \ dt, over all smooth 
curves with start and end points at xq and xi, respectively. 
So, if c(x) = c, for all x G 57, the geodesic distance be- 
tween y = p(x ) and y 1 = p{x\) is 



d M (f(x ), <p(xi)) = c\x Q - x x \ 2 



(2) 



When c = 1, i.e., p is an isometry, the geodesic distance 
in Ai and the Euclidean distance in the parameterization 
space M. m are the same. If c > 1 (c < 1) there is a global 
expansion (contraction) in the distances between points. 

It is evident from the above discussion that geodesic 
distances carry strong information about a non-linear do- 
main manifold such as Ai. However, their computation 
requires the knowledge of the analytical form of Ai via p 
and its Jacobian. In the manifold learning scenario con- 
sidered in this paper this analytical form is assumed un- 
known and, instead, we are given a finite set of data points 
lying on the smooth m-dimensional manifold Ai, with m 
also considered unknown. In order to reconstruct a domain 



3 



Step 1. Determine a Euclidean neighborhood 
graph G of the observed data y n accord- 
ing to the e-rule or the /c-rule as defined in 
ISOMAP 0. 

Step 2. For isometric embeddings compute the 
edge matrix £ of the ISOMAP graph J34) 
and for conformal imbeddings compute the 
edge matrix £ of the C-ISOMAP graph 
1151 . The entry of this symmetric 

matrix is the sum of the lengths of the 
edges in G along the shortest path between 
the pair of vertices (Yi,Yj) where the 
edge lengths between neighboring points 
Y\, Y~2 in G are defined as Euclidean dis- 
tance 1^1-1 ^2 1 in the ca se of ISOMAP or 
\Yi-Y 2 \/ y/M(l)M{2) in the case of C- 
ISOMAP where M(i) is the mean distance 
of Y i to its immediate nearest neighbors. 



Table 1: First two steps of the ISOMAP/C-ISOMAP al- 
gorithms to reconstruct Euclidean distances between X n 
on the embedding parameterization space from points y n 
over the embedded manifold 



manifold along with its parameterization we need to esti- 
mate the geodesic distances between pairs of data points 
in Ai and the respective Euclidean distances betweem pre- 
images of these data points in the parameterization space 
U. 

When Ai is an isometric embedding the ISOMAP al- 
gorithm 1 34 1 obtains such a reconstruction from a finite set 
of samples through estimation of the pairwise geodesic dis- 
tances. This estimate is computed from a Euclidean graph 
G connecting all local neighborhoods of data points in Ai. 
Specifically, ISOMAP proceeds as follows. Two methods, 
called the e-rule and the fc-rule |34|, have been proposed 
for contructing G. The first method connects each point 
to all points within some fixed radius e and the other con- 
nects each point to all its /c-nearest neighbors. The graph 
G defining the connectivity of these local neighborhoods 
is then used to approximate the geodesic distance between 
any pair of points as the shortest path through G that con- 
nects them. This results in an edge matrix whose en- 
try is the geodesic distance estimate for the (i,j)-th pair of 
points. Finally, ISOMAP obtains a smooth reconstruction 
of the manifold by applying the classical Multidimensional 
Scaling (MDS) method 1121 to the edge matrix. 



Steps one and two of ISOMAP are motivated by the 
fact that locally any smooth manifold is approximately 
"flat" and, so, the distances between neighboring points 
are well approximated by their Euclidean distances. For 
faraway points, the geodesic distance is estimated by sum- 
ming the sequence of such local approximations over the 
shortest path through the graph G. In Q it was proved 
that, when the data are random samples from a contin- 
uous distribution on the manifold Ai, the first two steps 
of ISOMAP recover the true geodesic distances with high 
probability if the data points form a sufficiently "dense" 
sampling of Ai and if Ai is free of "holes." When Ai is a 
global isometric embedding in R d , the estimated geodesic 
distances are also an estimate of distances in W n and the 
ISOMAP succeeds in its task of manifold reconstruction. 
For other types of embeddings, there is no guarantee that 
the ISOMAP will recover the correct parameterization. In 
(141 . a variant of this algorithm, called C-ISOMAP, was 
proposed to deal with the more general class of conformal 
embeddings. 

With regards to estimation of the intrinsic dimension 
m several methods have been proposed 1 25 1 . Most of these 
methods are based on linear projection techniques: a linear 
map is explicitly constructed and dimension is estimated 
by applying Principal Component Analysis (PCA), factor 
analysis, or MDS to analyze the eigenstructure of the data. 
These methods rely on the assumption that only a small 
number of the eigenvalues of the (processed) data covari- 
ance will be significant. Linear methods tend to overesti- 
mate m as they don't account for non-linearities in the data. 
Both nonlinear PCA (23 methods and the ISOMAP cir- 
cumvent this problem but they still rely on unreliable and 
costly eigenstructure estimates. Other methods have been 
proposed based on local geometric techniques, e.g., esti- 
mation of local neighborhoods 1 35 1 or fractal dimension 
1 8 1, and estimating packing numbers 1 26 1 of the manifold. 

3 Entropic Graph Estimators on 
Embedded Manifolds 

Let y n = Y"i, . . . , Y n be n independent identically dis- 
tributed (i.i.d.) random vectors in [0, l] d , with multivariate 
Lebesgue density /, which we will also call random ver- 
tices. Define the edge matrix £ as the nxn matrix of edge 
lengths (w.r.t. a specified metric) between pairs of vertices. 
A spanning graph T over y n is defined as the pair {V, E} 
where V — y n and E is a subset of edges from £ which 



4 



connect the vertices V. When £ is computed from pair- 
wise Euclidean distances T is called a Euclidean spanning 
graph. 

It has long been known 1 3 1 that, when suitably nor- 
malized, the sum of the edge weights of certain minimal 
Euclidean spanning graphs T over y n converges almost 
surely (a.s.) to the limit (3d J Rd f a (y)dy where where the 
integral is interpreted in the sense of Lebesgue, a G (0, 1) 
and (3d > 0. This a.s. limit is the integral factor J f a 
in what we will call the extrinsic Renyi a-entropy of the 
multivariate Lebesgue density /: 



log / r{y)dy 



(3) 



In the limit, when a — > 1 we obtain the usual Shannon 
entropy, — J Rd f(y) log f(y)dy. Graph constructions that 
converge to the integral in the limit were called continu- 
ous quasi-additive (Euclidean) graphs in 1 36 1 and entropic 
(Euclidean) graphs in \ 21\. See the monographs by Steele 
1331 and Yukich [ 36 1 for an excellent introduction to the 
theory of such random Euclidean graphs. 

The a-entropy has proved to be an important quan- 
tity in signal processing, where its applications range from 
vector quantization 1181 1311 to pattern matching [22 1 and 
image registration 1211 1191 . The a-entropy parameterizes 
the Chernoff exponent governing the minimum probability 
of error II II making it an important quantity in detection 
and classification problems. Like the Shannon entropy, the 
a-entropy also has an operational characterization in terms 
of source coding rates. In 1131 it was shown that the a- 
entropy of a source determines the achievable block-code 
rates in the sense that the probability of block decoding er- 
ror converges to zero at an exponential rate with rate con- 
slant 



3.1 Beardwood-Halton-Hammersley Theo- 
rem in M. d 

A remarkable result in geometrical probability was estab- 
lished by Beardwood, Halton and Hammersley almost half 
a century ago [|3j. Let y n = {Y\, . . . , Y n } be a set of 
points in ~M. d . A minimal Euclidean graph spanning y n is 
defined as the graph spanning y n having minimal overall 
length 

<(3y = mm5>p. (4) 



Here the sum is over all edges e in the graph T, |e| is the 
Euclidean length of e, and 7 6 (0, d) is called the edge 
exponent or power-weighting constant. For example when 
T is the set of spanning trees over y n one obtains the MST 
A minimal Euclidean graph is continuous quasi-additive 
when it satisfies several technical conditions specified in 
1361 (also see 1231 ). Continuous quasi-additive Euclidean 
graphs include: the minimal spanning tree (MST), the k- 
nearest neighbors graph (fc-NNG), the minimal matching 
graph (MMG), the traveling salesman problem (TSP), and 
their power-weighted variants. While all of the results in 
this paper apply to this larger class of minimal graphs we 
specialize to the MST for concreteness. 

Beardwood-Halton-Hammersley (BHH) Theorem 

13311361 : Let y n be an i.i.d. set of random variables taking 
values in W l having common probability distribution P. 
Let this distribution have the decomposition P = F + Q 
where F is the Lebesgue continuous component and Q is 
the singular component. The Lebesgue continuous compo- 
nent has a Lebesgue density (no delta functions) which is 

denoted f(x), x E R d . Let \y n ) be the length of the 
MST spanning y n and assume that d > 2 and < 7 < d. 
Then 



(y n )/n a 



Aj / f a (y)dy (a.s.), 



(5) 



where a — (d — ^)/d and (3d is a constant not depend- 
ing on the distribution P. Furthermore, the mean length 

E[L^ (y n )]/n a converges to the same limit. 

The limit on the right side of Q in the BHH theorem 
is zero when the distribution P has no Lebesgue continu- 
ous component, i.e., when F = 0. On the other hand, when 
P has no singular component, i.e., Q = 0, a consequence 
of the BHH Theorem is that 
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L 

log 



n (d-j)/d 



log A, 



(6) 



is an asymptotically unbiased and strongly consistent esti- 
mator of the extrinsic a-entropy (/) defined in (|3j- 

3.2 Generalization of BHH Thm. to Embed- 
ded Manifolds 

If the vertices y n = {Y±, . . . , Y n } are constrained to 
lie on a smooth m-dimensional manifold M C [0, 
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the distribution of Y i is singular with respect to Lebesgue 
measure, F = 0, and, as previously mentioned, the limit 
in the BHH Theorem is zero. However, as shown be- 
low, if M is defined by an isometric embedding from the 
parameterization space R m , if Y % has a continuous den- 
sity / on A4, and if ISOMAP is used to approximate the 
geodesic edge matrix, then the length of an MST con- 
structed from the geodesic edge matrix can be made to 
converge, after suitable normalization and transformation, 
to the intrinsic a-entropy H^ 4 (/) on M. defined by 



one 



f x (x)dx + o(n im ~^/ m ), (9) 



where fx is the density of Xi = tp~ l (Yi), Therefore the 
limits claimed in © for d < m and d > to are obvious. 
For d = rn the relation (|9} implies 



lira L^(X n )/n^ m -^/ m =f3 ri 



f x (x)dx, (10) 



H^(f) = -log [ f a (y)»M(dy), 
7 Jm 



(7) and it remains to show that this limit is identical to the limit 
asserted in JS}. 



where HM_{dy) denotes the differential volume element 
over Ai. 

More generally, assume that Ai is embedded in [0, l] d 
through the diffeomorphism p. As Xi = p^ 1 (Y \) lives 
in R m , let T be the Euclidean minimal graph spanning X n 



For an integrable function F defined on a domain 
manifold Ai defined by the diffeomorphism ip : W n 
Ai, the integral of F over Ai satisfies the relation 1 10 1: 



and having length function Uz (X n ) = IA 



F{y)^ M (dy) 



M 



F(<p(x))g(x)dx, (11) 



according to definition @. We have the following exten- 
sion of the BHH Theorem. 



Theorem 1 Let Ai be a smooth compact m-dimensional 
manifold embedded in [0, l] d through the diffeomorphism 
ip : M. m i— > Ai. Assume 2 < m < d and < 7 < to. Sup- 
pose that Y 1, Y 2-, ■ ■ ■ are i.i.d. random vectors on Ai hav- 
ing common density f with respect to Lebesgue measure 
fiM on Ai. Then, the length functional (</? -1 (3^n)) of 
the MST spanning <p~ l {y n ) satisfies 



where g(x) = ydet (J^J V ). Specializing F to the in- 
dicator function of a small volume centered at a point y 
il Q implies the following relation between volume ele- 
ments in Ai and K m : HM(dy) = g(x) dx. Furthermore, 
specializing to F(y) = f(y) it is clear from (II Q that 
fx{x) = f(ip(x))g(x). Therefore 



lim L^{ V -\y n ))/^ d 
00, 



d < 



f x (x)dx = I (f(ip(x))g(x)) a dx 

rMx))g a -\x)) g(x)dx, 

(8) 

which, after the change of variable x 1— ► <p{x), is equiva- 
171 lent to the integral in the limit ©. □ 



in 



< (3 m f M [det(jTJ v )] 2 f a {y)ii M {dy), d' 

0, d > m 



(a.s.) where a — (to — 7) jm. Furthermore, the 

mean E [L: 
same limit. 



mean E[L^ m (ip- 1 (y n ))]/n ( - d -^/ d converges to the 



This theorem is a simple consequence of the relation 
(|5} in the BHH Theorem and properties of integrals over 
manifolds. 



Our goal is to learn the entropy of non-linear data on 
a domain manifold together with its intrinsic dimension, 
given only the data set y n of n samples in the embedding 
space R d , and without knowledge of its embedding func- 
tion <p. If tp is an isometric or conformal embedding then 
it has been shown that for sufficiently dense sampling over 
M, i.e., for large n, the ISOMAP or the C-ISOMAP al- 
gorithm summarized in Tabled will approximate the ma- 
trix of pairwise Euclidean distances between the points 
Xn — <fi~ 1 (yn) in the domain space W n without explicit 
knowledge of ip. Thus if one uses this edge matrix to con- 
struct a MST over y n its length function will approximate 

(p~ 1 (y n )) and we can invoke Thm. \l\to characterize 



Proof of Thm. [7J By the BHH Theorem, with probability its asymptotic convergence properties. As the edge matrix 
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will contain approximations to the geodesic distances be- 
tween pairs of points (3^,!Vj) this graph will be called a 
geodesic MST (GMST). 

More specifically, assume that the embedding of M is 
isometric (conformal) and denote by Em the edge matrix 
Em over the points y n constructed by the ISOMAP (C- 
ISOMAP) algorithm 0Q3) as described in Tabled Define 
the geodesic MST T as the minimal graph over y n whose 
length is: 

L?(y n )= mjn5>|X,, d2) 

where |e|x ranges over the n 2 entries |ey|.M of the edge 
matrix E M computed by ISOMAP (C-ISOMAP). 

The following is the principal theoretical result of this 
paper and is a simple consequences of Thm. ^ 

Theorem 2 Let M be a smooth m- dimensional manifold 
embedded in [0, l] d through a conformal map <p : W n i— > 
M. Let 2 < m < d and < 7 < m. Suppose that 
Y"i, . . . , Y n are i.i.d. random vectors on M. with common 
density f w.r.t. Lebesgue measure fiM on M. Assume that 
each of the edge lengths \&ij\M in the edge matrix Em 
converge a.s. to \<p (Y {) — <p (Y as n — > 00. Then, 
the length functional of the GMST satisfies 

lim L^(y n )/n^ d '-^/ d ' -> (13) 
00, d < m 

' P m J M f a (y)g~'< /d (^ 1 (y))vM(dy), d' = m 

0, d > m 

(a.s.) where a = (m — 7) /m and g(x) ^= »J det 

Furthermore, the mean E[L^{y n )]/n { - d -^ /d converges 
to the same limit. 

Proof of Thm. |2] 

First express the normalized length functional 

L^(y n )/n ( - d '-^/ d as 

L^{y n )/^ d '-^i d = Lf(< P - 1 (y n ))/n( d -iy d 

■ 'L^{y n )/L^^-\y n )) 



By Thm. ^ me fi rst factor on the right converges (a.s.) 
to the the limit (|8}. Since the edges lengths used to con- 
struct L^ 4 (y n ) converge a.s. to the edge lengths used 
to construct L^ m (ip^ 1 (y n )) the term in brackets con- 
verges (a.s.) to 1. Hence the normalized length func- 
tional L J ^{y n )jn { ~ d ~^l d converges (a.s.) to the the limit 
10. By identifying (a — 1) = — 7/d, x = (fi^ 1 (y) and 
det = g((p~ 1 (y)), for d = m the integrand on 

the right of the limit (|8} is equivalent to: 

r(y) [det (^)]^ =/«(!/) [g(<p- l (y))}-* . 

□ 

If m > 2, as the parameter d is increased from 2 to 
00 the limit (II 3I > in Thm. [2] transitions from infinity to a 
finite limit and finally to zero over three consecutive steps 
d = m — 1, m, m + 1. As d indexes the rate constant 
n (d -i)/d of tne length functional L^ t (y n ), this abrupt 
transition suggests that the intrinsic dimension m and the 
intrinsic entropy might be easily estimated by investigat- 
ing the convergence rate of the GMST's length functional. 
This observation is the basis for the estimation algorithm 
introduced in the next section. 

We now specialize Theorem[2]to the following cases 
of interest. 



3.2.1 Isometric Imbeddings 

In the case that <p defines an isometric imbedding the 
ISOMAP algorithm is asymptotically able to recover the 
true Euclidean distances between the points in X n = 
f^ 1 {yn)- Thus the assumption of Thm. |2]is satisfied. Fur- 
thermore, J^J V — l m . Thus, for example, when L^ 4 (y n ) 
is the length of the geodesic MST constructed on the edge 
matrix generated by the ISOMAP algorithm, the limit Jl 31 
holds with the d = m limit replaced by 

0m I f a (y)VM(dy). 
Jm 

Furthermore, m/7 log (t^ 1 (X0/« (m_7)/m ~ lo S Z 3 ™) 
•converges a.s. to the intrinsic entropy |7}. 
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3.2.2 Isometric Imbeddings 
tion/Expansion 



with 



Contrac- 



In the case that tp defines an isometric imbedding with con- 
traction or expansion the C-ISOMAP algorithm is able to 
recover the true Euclidean distances between points in X n . 
Furthermore, J^J V = c l m where c is a constant. Thus, 
when L^ 4 (y n ) is the length of the geodesic MST con- 
structed on the edge matrix generated by the C-ISOMAP 
algorithm the limit Jl 3i holds with the d = m limit re- 
placed by 



A, 



.-7/2 



f a (y) f*M(dy). 



M 



con- 



Now m/ 7 log (L^(y„)/n (m - 7)/m - log A, 

verges a.s. up to an unknown additive constant —7/2 logc 
to the intrinsic entropy 0. We point out that in many 
signal processing applications (e.g. image registration) 
a constant bias on the entropy estimate does not pose a 
problem since an estimate of the relative magnitude of the 
entropy functional is all that is required. 



3.2.3 Non-isometric Imbeddings Denned by Confor- 
mal Mappings 



Initialize: Using entire database of 
signals y n construct geodesic distance 
matrix £ M using ISOMAP or C-ISOMAP. 
Select parameters : 
Po, Pi (Po<Pi<n), and N (N > 0) 

foz_p = p 1 ,...,p Q 
L = Q 

for N' = 1, ... ,N 

Randomly select a subset of p signals y p frc 
Compute geodesic MST length L p over y p 
L = L + L p 
end for 

Compute sample average geodesic MST length 
E[L?(y p )] = L/N 
end for 

Estimate m and H"(f) from {E[L^(y p )]} p p g pi 



Table 2: GMST resampling algorithm for estimating in- 
trinsic dimension m and intrinsic entropy H„ . 

a-entropy. However, as can be seen from the discussion in 
the next section, as the rate exponent of the GMST length 
depends on m we can still perform dimension estimation 
in this case. 



In the case that tp is a general (non-isometric) conformal 
mapping the C-ISOMAP algorithm is once again able to 
recover the true Euclidean distances between points in X n . 
Furthermore, J^J^ = c(x) l m . Thus, when L^ 4 (y n ) is 
the length of the geodesic MST constructed on the edge 
matrix generated by the C-ISOMAP algorithm, the limit 
(II 31 holds with the d = m limit replaced by 

Pm [ f a (y)c-^ 2 ( ( p- 1 (y)) f x M (dy). 

J M 

In this case m/7log (t^ l (y n )/n ( - m -^/ m - log/3 m ) 

converges a.s. up to an additive constant to the weighted 
intrinsic entropy 

^— log / f a (y) c-~>/ 2 (<p-\y))vM(y) ■ 

The weighted a-entropy is a "version" of the standard un- 
weighted a-entropy H^ 4 (/) which is "tilted" by the space- 
varying volume element of M.. This unknown weighting 
makes it impossible to estimate the intrinsic unweighted 



3.2.4 Non-conformal Diffeomorphic Imbeddings 

When tp defines a general diffeomorphic embedding a re- 
sult analogous to Thm. [2 easily follows giving an identi- 
cal limiting relation to © except that L^ 4 (y n )/n (d ~ 7)/d 
converges to 

(3 m [ f a (y) [det (J? J tp )]- y/M fi M (dy), 
Jm 

when d = m. However, without an extension of the C- 
ISOMAP algorithm that can provably learn the Euclidean 
distances between the points X n in the parametrization 
space, Thm. |2]is not applicable. To the best of our knowl- 
edge such an extension of C-ISOMAP does not yet exist. 



4 GMST Algorithm 

Now that we have characterized the asymptotic limit Jl 3I > 
of the length function of the GMST we here apply this the- 
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ory to jointly estimate entropy and dimension. The key is 
to notice that the rate of convergence is strongly depen- 
dent on m while the rate constant in the convergent limit 
is equal to the intrinsic a-entropy. We use this strong rate 
dependence as a motivation for a simple estimator of m. 
Throughout we assume that the geodesic minimal graph 
length L^ 4 (y n ) is determined from an edge matrix £m 
that satisfies the assumption of Thm.EJ e.g., obtained using 
ISOMAP or C-ISOMAP. We set the edge power weighting 
in L^ 4 (y n ) to 7 = 1 and assume that to > 2. This guar- 
antees that L^ 4 {y n )l n^ d ~ 7 )/ d has a non-zero finite con- 
vergent limit for d = to. Next define l n = log (y n ) . 
According to Jl 31 l n has the following approximation 



l n = a log n + b + e„, 



where 



b = 



(to — 7)/to, 
logf3 m + 7 /mH^(f), 



(14) 



(15) 



a = (to — 7) /to and e n is an error residual that goes to 
zero a.s. as n — > 00. 

The additive model J14I could be the basis for many 
different methods for estimation of to and H. For exam- 
ple, we could invoke a central limit theorem on the MST 
length functional 1 1 1 to motivate a Gaussian approximate 
to e n and apply maximum likelihood principles. How- 
ever, in this paper we adopt a simpler non-parametric least 
squares strategy which is based on resampling from the 
population y n of available points in Ai. The algorithm 
is summarized in Table EJ Specifically, let p%, ■ . ■ ,pq, 
1 < pi < . . . ,< pq < n, be Q integers and let 
N be an integer that satisfies N/n = p for some fixed 
p 6 (0, 1]. For each value of p £ {pi, . . . ,pq} gener- 
ate N independent samples yi, j = 1, . . . , N and from 
these samples compute the empirical mean of the GMST 
length functionals L p = A" 1 ~£f=i ^0^)- Defining 
7 = [logL pl , . . . ,logL Pl ] T , and motivated by dl4i we 
write down the linear vector model 



I = A 



(16) 



where 



.4 



\ogpx 
1 



logPQ 

1 



Expressing a and b explicitly as functions of to and H a 
via \\5\ . the dimension and entropy quantities could be 



estimated using a combination of non-linear least squares 
(NLLS) and integer programming. Instead we take a sim- 
pler method-of-moments (MOM) approach in which we 
use d!6l > to solve for the linear least squares (LLS) esti- 
mates a, b of a, b followed by inversion of the relations 
H5\ . After making a simple large n approximation, this 
approach yields the following estimates: 



= L7/(i-a)J 



H. 



M 



7 



log A; 



It is easily shown that the law of large numbers and Thm. 
12 imply that this estimator is consistent as n — > 00. We 
omit the details. 

A word about determination of the sequence of con- 
stants {f3 m } m is in order. First of all, in the large n regime 
for which the above estimates were derived, (3 m is not re- 
quired for the dimension estimator. [3 m is the limit of the 
normalized length functional of the Euclidean MST for a 
uniform distribution on the unit cube [0, l] m . Closed form 
expressions are not available but several approximations 
and bounds can be used in various regimes of to 1361 El. 
Another possibility is to determine f3 m by simulation of 
the Euclidean MST length on the m-dimensional cube for 
uniform random samples. In our simulations, described be- 
low, we have used the large to approximation of Bertsimas 
andvanRyzin |6j: log/3 m ~ 7/2 log(TO,/27re). 

Before turning to the application we briefly discuss 
computational issues. We have developed a custom im- 
plementation of the MST algorithm which is a modifica- 
tion of Kruskal's algorithm |30|. This implementation im- 
plements an efficient disk radius algorithm to restrict the 
search space yielding substantial runtime speedup. This 
has allowed us to routinely implement the MST on tens of 
thousands of points. 



5 Application 

We performed several preliminary validation tests of the 
GMST estimator on simulated data including: a linear 
manifold and the swiss roll manifold investigated in |34|. 
Due to space limitations we will not present results from 
these validation tests. Rather we will present a very simple 
example to illustrate the applicability of GMST intrinsic 
dimension and entropy estimates. For this purpose we in- 
vestigated a set of black-and-white images of several indi- 
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viduals taken from the Yale Face Database B 1 17 1. This 
is a publicly available database containing face images 
of 10 subjects with 585 different viewing conditions for 
each subject. These consist of 9 poses and 65 illumination 
conditions (including ambient lighting). The images were 
taken against a fixed background which we did not bother 
to segment out. We think this is justified since any fixed 
structures throughout the images would not change the in- 
trinsic dimension or the intrinsic entropy of the dataset. We 
randomly selected 3 individuals from this data base and 
subsampled each person's face images down to a 64 x 64 
pixel image. The pixels in each of the images were lexi- 
cographically reordered into vectors residing in a 4096 di- 
mensional space. 

We studied the dimension and entropy of each per- 
son's face as follows. We first generated the Euclidean 
nearest neighbor graph G used by ISOMAP in Step 1 (see 
Tabled for each of the three sets of 585 images. We then 
investigated the trajectory of the mean GMST as a func- 
tion of n for each person's face folio. Specifically 26 x 25 
random samples (with replacement) were selected to form 
26 resampled face subsets of sizes ranging from 100 to 
585, respectively. Step 2 of the ISOMAP algorithm was 
then implemented on each sample to generate 650 differ- 
ent edge matrices. Subsequently the GMST was computed 
from each of these edge matrices and for each of the 26 
folio sizes the 25 resampled GMST length functions were 
averaged to obtain 3 average GMST length sequences over 
n. In the GMST implementation the edge exponent 7 was 
fixed at a value of 1 . 

In Fig. [Qthe sequence of average GMST length func- 
tional is plotted for each of the three faces. The symbols 
denote the locations of the 26 values of n chosen for study 
and the corresponding values of the average GMST length. 
Note that the average GMST length sequences appear to 
increase almost linearly over n for each of the three per- 
sons, albeit with different rate constants. However, after 
a log-log transformation, shown in Fig. |2] it becomes evi- 
dent that the linear model for the of the mean GMST length 
functional is not valid for small n. Fig. |3]is a blowup of 
Fig. |2]for n > 500 and experimentally confirms the large- 
n linear behavior predicted by Thm. |2] and supports the 
validity of the linear model d!4t . 

Using the average GMST length sequences we next 
estimated slope and intercept parameters a, b of the linear 
model and implemented the MOM estimator of dimension 
and entropy as described in the previous section. Only the 
range n > 500 was used in fitting the linear model. The 



x1Q 5 GMST Length Fuctional 
8 1 1 1 1 




100 200 300 400 500 600 

n 



Figure 1 : The average geodesic MST growth rates for three 
different face images in the Yale face database B. 
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log n 



Figure 2: Log-log plot of Fig. ^ 



10 



GMST Length Fuctional 




6.3 6.35 
log n 



Figure 3: Blowup of Fig. 13 showing linearity of geodesic 
MST growth rates for large n. 





Facel 


Face2 


Face 3 


fa 


6 


5 


6 


H (bits) 


70.4 


68.8 


73.8 



Table 3: Dimension estimates fa and entropy estimates H 
for three faces in the Yale Face Database B. 



MOM estimator of ra was rounded to the nearest integer 
and the parameter f3 m was estimated by the large m ap- 
proximation 1 6 1 . The results are summarized in Table [5] 
As a result of this procedure the estimated face dimension 
m was observed to vary between 5 and 6 for each of the 
individuals. The intrinsic entropy estimate expressed in 
log base 2 was concentrated around 70 bits. Note that as 
a = (m — l)/m is close to one for these estimated values 
of m the estimates of a-entropy are expected to be close to 
the Shannon entropy. These entropy estimates suggest that 
one should be able to get away with a model incorporating 
at most 6 parameters to describe the range 585 poses and 
illuminations of any of the three faces. An MDS ISOMAP 
analysis of the same three faces gave slightly higher esti- 
mates of dimension, varying between 6 and 7. 



6 Conclusion 



We have presented a novel method for intrinsic dimension 
estimation and entropy estimation on smooth domain man- 
ifolds. With regards to intrinsic dimension estimation, the 



method proposed has two main advantages. First, it is 
global in the sense that tyhe MST is constructed over the 
entire and we thus avoid local linearizations. Second, un- 
like previous methods it simple to implement and does not 
require tuning any user-defined parameters such as eigen- 
value thresholds or sizes of local neighborhoods. The 
GMST methods described in this paper are currently be- 
ing applied to a large number of dimension reduction and 
entropy characterization problems including: gene cluster- 
ing in bioinformatics, Internet traffic analysis, lung nodule 
classification, and radar signature analysis. 
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